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Abstract 

User  satisfaction  is  a  common  evaluation  met¬ 
ric  in  task-oriented  dialogue  systems,  whereas 
tutorial  dialogue  systems  are  often  evaluated 
in  terms  of  student  learning  gain.  However, 
user  satisfaction  is  also  important  for  such 
systems,  since  it  may  predict  technology  ac¬ 
ceptance.  We  present  a  detailed  satisfaction 
questionnaire  used  in  evaluating  the  Beetle 
II  system  (REVU-NL),  and  explore  the  un¬ 
derlying  components  of  user  satisfaction  us¬ 
ing  factor  analysis.  We  demonstrate  interest¬ 
ing  patterns  of  interaction  between  interpreta¬ 
tion  quality,  satisfaction  and  the  dialogue  pol¬ 
icy,  highlighting  the  importance  of  more  fine¬ 
grained  evaluation  of  user  satisfaction. 

1  Introduction 

User  satisfaction  is  one  of  the  primary  evaluation 
measures  for  task-oriented  spoken  dialogue  systems 
(SDS):  the  goal  of  an  SDS  is  to  accomplish  the  task, 
and  to  keep  the  user  satisfied,  so  that  they  will  want 
to  continue  using  the  system.  Typically,  the  PAR¬ 
ADISE  methodology  (Walker  et  al.,  2000)  is  used  to 
establish  a  performance  function  which  relates  user 
satisfaction  measured  through  questionnaires  to  in¬ 
teraction  parameters  that  can  be  derived  from  sys¬ 
tem  logs.  This  function  can  then  be  used  to  better 
understand  which  properties  of  the  interaction  have 
the  most  impact  on  the  users,  and  to  compare  differ¬ 
ent  system  versions. 

In  contrast,  tutorial  dialogue  systems  arc  typically 
evaluated  in  terms  of  student  learning  gain,  by  com¬ 
paring  student  scores  on  standardized  tests  before 


and  after  interacting  with  the  system.  This  is  clearly 
an  important  evaluation  metric,  since  it  directly  as¬ 
sesses  the  benefit  students  obtain  from  using  the  sys¬ 
tem.  However,  it  is  also  important  to  evaluate  user 
satisfaction,  since  it  can  influence  students’  willing¬ 
ness  to  use  computer  tutors  in  a  long  run.  Thus, 
recent  studies  have  looked  at  factors  that  could  in¬ 
fluence  user  satisfaction  in  tutorial  dialogue,  such  as 
different  tutoring  policies  (Forbes-Riley  and  Litman, 
201 1),  quality  of  speech  output  (Forbes-Riley  et  al., 
2006),  and  students’  prior  attitudes  towards  technol¬ 
ogy  (Jackson  et  al.,  2009). 

Assessing  user  satisfaction,  however,  is  not  a 
straightforw  ard  task.  As  we  discuss  in  more  detail  in 
Section  2,  user  satisfaction  is  known  to  be  a  complex 
multi-dimensional  construct,  composed  of  largely 
independent  factors  such  as  perceived  ease  of  use 
and  perceived  usefulness.  Therefore,  questionnaires 
used  for  assessing  satisfaction  need  to  be  validated 
through  user  studies,  and  different  satisfaction  di¬ 
mensions  should  be  assessed  independently.  There¬ 
fore,  SDS  researchers  are  now  stalling  to  use  tech¬ 
niques  from  psychometrics  for  this  puipose  (Hone 
and  Graham,  2000;  Moller  et  al.,  2007).  However, 
user  satisfaction  studies  tutorial  dialogue  currently 
rely  on  simple  questionnaires  adapted  from  either 
task-oriented  SDS  or  non-dialogue  intelligent  tutor¬ 
ing  systems  (Michael  et  al.,  2003;  Forbes-Riley  et 
al.,  2006;  Forbes-Riley  and  Litman,  2011;  Jackson 
et  al.,  2009),  and  these  questionnaires  have  not  been 
validated  for  tutorial  dialogue  systems. 

In  this  paper,  we  make  the  first  step  tow  ards  de¬ 
veloping  a  better  user  satisfaction  questionnaire  for 
tutorial  dialogue  systems.  We  present  a  user  satis- 
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faction  evaluation  of  the  Beetle  II  tutorial  dialogue 
system.  Starting  with  a  detailed  user  satisfaction 
questionnaire,  we  employ  exploratory  factor  analy¬ 
sis  to  discover  a  set  of  dimensions  for  the  students’ 
satisfaction  with  a  dialogue-based  tutor.  We  then 
use  the  factors  we  derived  to  compare  user  satisfac¬ 
tion  between  two  versions  of  our  computer  tutor  that 
use  different  policies  for  generating  the  tutor’s  feed¬ 
back.  We  investigate  the  relationships  between  the 
subjective  satisfaction  dimensions  and  the  objective 
learning  gain  metric  for  the  two  systems.  Finally,  we 
carry  out  a  more  detailed  investigation  of  our  prior 
results  on  the  relationship  between  user  satisfaction 
and  interpretation  quality  in  tutorial  dialogue.  Our 
analysis  also  provides  insights  for  further  improving 
the  questionnaire  we  developed  and  gives  an  exam¬ 
ple  of  how  user  satisfaction  metrics  developed  for 
task-oriented  dialogue  can  be  adapted  to  different 
dialogue  applications.  It  also  opens  new  questions 
about  how  different  properties  of  the  interaction  af¬ 
fect  user  satisfaction  in  tutorial  dialogue,  which  can 
be  investigated  in  future  work. 

The  rest  of  the  paper  is  organized  as  follows.  We 
discuss  the  approaches  for  assessing  user  satisfac¬ 
tion  with  SDS  in  Section  2.  In  Section  3  we  describe 
the  Beetle  II  tutorial  dialogue  system  used  in  this 
evaluation.  We  describe  our  questionnaire  design  in 
Section  4,  and  describe  its  use  in  Beetle  II  evalu¬ 
ation  in  Section  5.  We  conclude  by  discussing  the 
implication  of  our  analysis  for  tutorial  dialogue  sys¬ 
tem  evaluation  in  Section  6. 

2  Background 

A  typical  approach  to  assessing  user  satisfaction  in 
dialogue  systems  is  collecting  user  survey  data  by 
asking  users  to  rate  their  agreement  with  statements 
such  as  “the  system  was  easy  to  use”.  In  the  simplest 
case  of  early  PARADISE  studies,  the  questionnaires 
contained  5  items  assessing  different  dimensions  of 
satisfaction,  which  were  then  summed  to  produce  a 
total  satisfaction  score. 

However,  using  simple  questionnaires  has  draw¬ 
backs  now  recognized  by  the  SDS  community.  First, 
if  individual  questions  arc  expected  to  assess  differ¬ 
ent  dimensions  of  user  satisfaction,  they  need  to  be 
validated  first,  or  else  they  may  be  ambiguous  and 
mean  different  things  to  different  users.  Second, 


summing  or  averaging  over  questions  measuring  dif¬ 
ferent  satisfaction  components  may  not  be  the  best 
approach,  since  it  may  conflate  unrelated  judgments 
(Hone  and  Graham,  2000). 

To  address  this  problem,  SDS  researchers  have 
started  using  more  complex  questionnaires,  where 
each  underlying  dimension  of  user  satisfaction  is  as¬ 
sessed  through  multiple  questions.  Factor  analysis  is 
then  used  to  determine  which  questions  are  related 
to  one  another  (and  therefore  arc  likely  to  be  assess¬ 
ing  the  same  underlying  satisfaction  dimension),  and 
to  discard  possibly  ambiguous  questions.  Then,  the 
PARADISE  methodology  can  be  used  to  relate  dif¬ 
ferent  interaction  parameters  to  individual  compo¬ 
nents  of  user  satisfaction. 

Several  such  studies  have  been  conducted  recently 
(Hone  and  Graham,  2000;  Larsen,  2003;  Moller  et 
al.,  2007;  Wolters  et  al.,  2009),  covering  command- 
and-control  and  information-seeking  dialogue.  The 
questionnaires  in  those  studies  contained  25  to  50 
items,  and  factor  analyses  typically  resulted  in  6-  or 
7-factor  solutions,  with  dimensions  such  as  accept¬ 
ability,  affect,  system  response  accuracy  and  cogni¬ 
tive  demand.  The  underlying  factors  found  by  those 
analyses  tend  to  match  up  well,  but  not  to  over¬ 
lap  perfectly.  In  comparison,  all  user  satisfaction 
questionnaires  for  tutorial  dialogue  systems  that  we 
arc  aware  of  contain  10-15  items  which  arc  either 
summed  up  for  PARADISE  studies,  or  compared 
individually  to  track  system  improvement  (Michael 
et  ah,  2003;  Forbes-Riley  et  ah,  2006;  Forbes-Riley 
and  Fitman,  2011;  Jackson  et  ah,  2009). 

In  this  paper,  we  apply  the  more  sophisticated 
SDS  evaluation  methodology  to  the  Beetle  II  tu¬ 
torial  dialogue  system.  We  devise  a  more  sophis¬ 
ticated  user  satisfaction  questionnaire  using  SDS 
questionnaires  for  guidance  and  then  apply  factor 
analysis  to  investigate  the  underlying  dimensions. 
We  compare  our  results  to  analyses  from  two  pre¬ 
vious  studies:  SASSI  (Hone  and  Graham,  2000), 
which  is  a  validated  questionnaire  intended  for  use 
with  a  variety  of  task-oriented  dialogue  systems, 
and  a  more  recent  “modified  SASSI”  questionnaire 
which  is  a  version  of  SASSI  adapted  for  use  with  the 
INSPIRE  home  control  system  (Moller  et  ah,  2007). 
Henceforth  we  will  refer  to  this  as  INSPIRE. 


3  Beetle  II  Tutorial  Dialogue  System 

The  goal  of  Beetle  II  (Dzikovska  et  al.,  2010c) 
is  to  teach  students  conceptual  knowledge  in  the  do¬ 
main  of  basic  electricity  and  electronics.  The  system 
is  built  on  the  premise  that  encouraging  students  to 
explain  their  answers  and  to  talk  about  the  domain 
will  lead  to  improved  learning,  a  finding  consistent 
with  analyses  of  human-human  tutoring  in  several 
domains  (Purandare  and  Litman,  2008;  Litman  et 
al.,  2009).  Beetle  II  has  been  engineered  to  test 
this  hypothesis  by  eliciting  contentful  talk  through 
explanation  questions. 

The  Beetle  II  learning  material  consists  of  two 
self-contained  lessons  suitable  for  college-level  stu¬ 
dents  with  no  prior  knowledge  of  basic  electricity 
and  electronics.  The  lessons  take  4  to  5  hours  to 
complete,  and  consist  of  reading  materials  and  inter¬ 
active  exercises.  During  the  exercises,  the  students 
interact  with  a  circuit  simulator,  building  electrical 
circuits  containing  bulbs,  batteries  and  switches,  and 
using  a  multimeter  to  measure  voltage.  Then  the 
tutor  asks  students  to  explain  circuit  behavior,  for 
example,  “Why  was  bulb  A  on  when  switch  Y  was 
open  and  switch  Z  was  closed?”  In  addition,  at  dif¬ 
ferent  points  in  the  lesson  the  tutor  asks  “summary” 
questions,  asking  students  to  define  concepts  such 
as  voltage,  and  verbalize  general  patterns  such  as 
“What  arc  the  conditions  that  arc  required  for  a  bulb 
to  light?”.  At  present,  students  use  a  typed  chat  in¬ 
terface  to  communicate  with  the  system.1 

We  built  and  evaluated  two  versions  of  the  sys¬ 
tem  (Dzikovska  et  al.,  2010a).  The  baseline  non- 
adaptive  tutor  (BASE)  requires  students  to  produce 
answers,  but  does  not  provide  any  remediation  and 
immediately  states  the  correct  answer.  The  fully 
adaptive  version  (FULL)  engages  in  dialogue  with 
the  student,  and  tailors  its  feedback  to  the  student’s 
answer  by  confirming  its  correct  parts  and  giving 
hints  in  order  to  help  students  fix  missing  or  incor¬ 
rect  parts.  The  FULL  system  generates  feedback  au¬ 
tomatically  based  on  a  detailed  analysis  of  the  stu¬ 
dent’s  input,  and  is  capable  of  giving  hints  at  differ¬ 
ent  levels  of  specificity  depending  on  the  student’s 
previous  performance. 

1 A  speech  interface  is  being  developed,  but  typed  communi¬ 
cation  is  common  in  online  and  distance  learning,  and  therefore 
is  an  acceptable  choice  for  tutorial  dialogue  as  well. 


These  two  system  versions  were  designed  to  eval¬ 
uate  the  impact  of  adaptive  feedback  (within  the  lim¬ 
itations  of  current  language  interpretation  technol¬ 
ogy)  on  student  learning  and  satisfaction.  Our  initial 
data  analysis  focused  on  the  differences  in  student 
language  depending  on  the  condition  (Dzikovska  et 
al.,  2010a),  and  on  the  impact  of  different  types  of 
interpretation  errors  on  learning  gain  and  user  sat¬ 
isfaction  (Dzikovska  et  al.,  2010b).  However,  these 
initial  results  were  based  on  an  aggregate  satisfac¬ 
tion  score  obtained  by  averaging  over  scores  for  all 
questions  in  our  user  satisfaction  questionnaire.  In 
this  analysis,  we  take  a  more  detailed  look  at  the  dif¬ 
ferent  factors  that  contribute  to  students  satisfaction 
with  the  system,  and  their  relationship  with  learning 
gain  and  interpretation  quality. 

4  Data  Collection 

4.1  Questionnaire  Design 

To  support  user  satisfaction  evaluation  we  developed 
a  satisfaction  questionnaire,  REVU-IT  (Report  on 
the  Enjoyment,  Value,  and  Usability  of  an  Intelli¬ 
gent  Tutor).  It  consists  of  63  items  which  cover  all 
aspects  of  interaction  with  the  tutoring  system:  the 
clarity  and  usefulness  of  the  reading  material;  the 
graphical  user  interface  to  the  circuit  simulator;  in¬ 
teraction  with  the  dialogue  tutor;  and  the  overall  im¬ 
pression  of  the  Beetle  II  system  as  a  whole.  The 
reading  material,  graphical  user  interface  and  inter¬ 
action  with  the  tutor  sections  are  complementary, 
because  they  cover  separate  parts  of  the  Beetle  II 
interface.  We  expect  that  all  of  these  three  compo¬ 
nents  contribute  to  the  overall  impression  score.  For 
purposes  of  this  paper,  we  will  focus  on  the  paid  of 
the  questionnaire  that  relates  to  the  natural  language 
interaction  with  the  tutor  (REVU-NL),  and  its  re¬ 
lationship  to  the  overall  impression  score  (REVU- 
OVERALL). 

The  REVU-IT  questionnaire  was  developed  by 
experienced  cognitive  psychologists  (two  of  the  au¬ 
thors  of  this  paper).  The  REVU-NL  section  con¬ 
sists  of  35  items  shown  in  Appendix  A.  Its  design 
was  guided  by  questionnaires  used  in  previous  re¬ 
search,  including  INSPIRE  and  a  questionnaire  used 
to  evaluate  the  ITSPOKE  tutorial  dialogue  system 
(Forbes-Riley  et  al.,  2006).  REVU-NL  contains  a 
number  of  items  from  these,  but  omits  items  that  are 


not  relevant  to  the  Beetle  II  domain  (e.g,  “Domes¬ 
tic  devices  can  be  operated  efficiently  with  the  sys¬ 
tem”  or  “The  tutor  responded  effectively  after  I  was 
uncertain”),  and  adds  extra  questions  related  to  tu¬ 
toring  (e.g.,  “Our  dialogues  quickly  led  to  me  hav¬ 
ing  a  deeper  understanding  of  the  material”),  based 
on  the  authors’  previous  experience  in  human  factors 
research.  We  also  slightly  rephrased  all  questions  to 
refer  to  “the  tutor”  rather  than  “the  system”. 

The  REVU-OVERALL  section  of  REVU-IT 
consists  of  5  items  assessing  the  student’s  satis¬ 
faction  with  their  learning  as  a  whole.  The  ques¬ 
tions  arc:  “Overall,  I  am  satisfied  with  my  experi¬ 
ence  learning  about  electricity  from  this  system.”; 
“Working  in  this  learning  environment  was  just  like 
working  one-on-one  with  a  human  tutor”;  “I  would 
have  preferred  to  learn  about  electricity  in  a  different 
way.”;  “I  would  use  this  system  again  in  the  future  to 
continue  to  learn  about  electricity.”;  “I  would  like  to 
be  able  to  use  a  system  like  this  to  learn  about  other 
topics  in  the  future.”.  We  use  the  averaged  score  over 
these  5  items  to  represent  the  student’s  overall  satis¬ 
faction  with  the  learning  environment,  referring  to  it 
as  “overall  satisfaction”. 

Adding  new  questions  to  the  REVU-NL  ques¬ 
tionnaire  on  top  of  already  existing  questions  is  the 
initial  step  in  addressing  the  issues  discussed  in  Sec¬ 
tion  2:  validating  the  individual  questions  and  dis¬ 
covering  the  underlying  dimensions  of  user  satis¬ 
faction.  Having  a  large  number  of  questions  ask¬ 
ing  about  the  same  aspects  of  the  interaction  will 
allow  us  to  group  related  questions  together  into  di¬ 
mensions  (“factors”),  and  also  to  discover  ambigu¬ 
ous  questions  that  will  need  to  be  improved  in  future 
studies.  The  detailed  discussion  of  the  technique  and 
issues  involved  is  presented  in  Hone  and  Graham 
(2000). 

4.2  Participants 

We  used  REVU-IT  as  paid  of  a  controlled  experi¬ 
ment  comparing  the  BASE  and  FULL  versions  of  the 
system.  We  recruited  87  participants  from  a  uni¬ 
versity  in  the  Southern  US,  paid  for  participation. 
Participants  had  little  knowledge  of  the  domain. 
Each  participant  signed  consent  forms  and  com¬ 
pleted  a  pre-test,  then  worked  through  both  lessons 
(with  breaks),  and  then  completed  a  post-test  and  a 
REVU-IT  questionnaire.  Each  session  lasted  3.5 


hours  on  average. 

Out  of  87  participants  that  completed  the  study,  13 
had  an  inordinate  amount  of  trouble  with  interface: 
they  typed  utterances  that  could  not  be  interpreted 
by  the  tutor  (defined  as  having  more  than  3  standard 
deviations  in  interpretation  errors  compared  to  the 
rest),  did  not  follow  tutor’s  instructions  or  experi¬ 
enced  system  crashes.  In  addition,  two  participants 
were  learning  gain  outliers  (again,  more  than  3  stan¬ 
dard  deviations  from  average).  These  participants 
were  removed  from  the  analysis.  The  questionnaires 
from  the  remaining  72  participants  arc  used  in  our 
data  analysis. 

5  Analysis 

5.1  Underlying  satisfaction  dimensions 

Each  item  in  the  REVU-NL  questionnaire  used  a 
5-point  Likert  scale,  from  “completely  disagree”  (1) 
to  “fully  agree”  (5).  Most  of  the  items  were  phrased 
so  that  the  agreement  with  the  statement  meant  a 
positive  evaluation  of  the  system.  For  a  few  items, 
however,  the  polarity  was  reversed  (e.g.,  “The  tutor 
was  not  helpful”).  Those  items  were  reverse-coded, 
with  1  meaning  “fully  agree”  and  5  “completely  dis¬ 
agree”,  to  ensure  that  a  lower  score  on  all  questions 
corresponds  to  a  negative  assessment. 

Following  Hone  and  Graham  (2000),  we  used 
exploratory  factor  analysis  to  group  questionnaire 
items  into  clusters  representing  different  dimen¬ 
sions.  One  of  the  standard  approaches  in  determin¬ 
ing  how  many  factors  (“question  clusters”)  to  use 
is  the  scree  test  which  checks  the  number  of  eigen¬ 
values  in  the  question  covariance  matrix  which  arc 
greater  than  1.  These  typically  correspond  to  prin¬ 
cipal  components  which  reflect  the  underlying  ques¬ 
tionnaire  structure.  The  scree  test  showed  7  eigen¬ 
values  greater  than  1,  resulting  in  the  7-factor  solu¬ 
tion  presented  in  Table  1. 

The  loadings  in  the  table  arc  the  correlation  coef¬ 
ficients  between  the  individual  question  scores  and 
the  variables  representing  the  factors.  Most  of  the 
correlations  arc  quite  high,  indicating  that  the  ques¬ 
tions  arc  strongly  correlated  both  among  themselves 
and  the  underlying  factor.  However,  the  last  two  fac¬ 
tors  contain  only  non-loading  questions  according  to 
the  criteria  in  (Hone  and  Graham,  2000),  i.e.,  ques¬ 
tions  for  which  the  correlations  arc  too  weak  to  be 


#  Question 


Load¬ 
ing 

1  t29:  Knew  what  to  say  at  each  point  0.82 

1  t22:  Easy  to  interact  with  the  tutor.  0.79 

1  t9:  Not  sure  what  was  expected.  0.73 

1  tl8:  Knew  what  to  say  to  the  tutor.  0.70 

1  1 14:  The  tutor  was  too  inflexible.  0.69 

1  1 19:  Able  to  recover  easily  from  errors  0.69 

1  t24:  Easy  to  learn  to  speak  to  tutor.  0.69 

1  1 16:  Tutor  didn’t  do  what  I  wanted.  0.65 

1  t3:  Tutor  understood  me  well.  0.65 

1  tl5:  Working  as  easy  as  with  a  human.  0.64 

1  tl3:  Had  to  concentrate  when  talking.  0.62 

2  t3 1  Tutor  was  an  efficient  way  to  learn.  0.79 

2  t32:  Easy  to  learn  from  the  tutor.  0.78 

2  t34:  Tutor  was  worthwhile  0.72 

3  t28:  Tutor  was  irritating.  0.76 

3  tlO:  Tutor  was  fun.  0.74 

3  t7:  Enjoyed  talking  with  tutor.  0.72 

3  t30:  Dialogues  were  boring.  0.66 

4  t2:  Tutor  took  too  long  to  respond  0.84 

4  t33:  Tutor  responded  quickly  0.84 

5  t26:  Didn’t  always  understand  tutor  0.89 

6  (t3:  The  tutor  understood  me  well)  0.4 

7  (t25:  Comfortable  talking  with  tutor)  0.59 


Table  1:  Factors  derived  from  the  REVU-NL  question¬ 
naire,  with  question  loadings  for  the  factor  to  which  each 
question  was  assigned.  Question  text  shortened  due  to 
space  limitations,  full  text  presented  in  the  appendix. 
Non-loading  questions  in  parentheses. 

reliable.  In  addition,  factors  4  and  5  had  fewer  than 
3  questions.  Since  the  number  of  subjects  in  our  data 
set  is  small,  such  factors  may  not  be  reliable.  There¬ 
fore,  we  focus  our  remaining  analysis  on  the  top  3 
factors  from  the  questionnaire,  each  of  which  con¬ 
tains  3  or  more  questions. 

Twelve  questions  in  REVU-NL  were  “cross¬ 
loading”  according  to  criteria  in  Hone  and  Graham 
(2000),  that  is,  their  two  top  loadings  differed  by 
less  than  0.2.  This  indicates  questions  that  are  likely 
to  be  ambiguous,  since  they  are  strongly  correlated 
with  two  (theoretically  independent)  variables.  Such 
questions  should  be  refined  and  re-designed  in  future 
surveys.  These  were  questions  tl,  t4,  t6,  til,  tl2, 
tl7,  t20,  t21,  t23,  t25,  t27,  t35  from  the  appendix. 
We  removed  them  from  our  solution,  and  discuss  the 


implications  for  survey  design  in  Section  6. 

The  first  component  in  our  analysis  lines  up  well 
with  the  Transparency  and  Cognitive  load  factors 
from  INSPIRE,  and  Response  accuracy.  Cognitive 
demand  and  Habitability  from  SASSI,  though  it  was 
not  split  into  individual  factors  as  in  those  analyses. 
We  will  refer  to  this  factor  as  Transparency.  The 
second  component  contains  questions  specific  to  tu¬ 
toring.  However,  it  is  similar  to  the  Acceptability 
dimension  from  INSPIRE  (the  original  SASSI  ques¬ 
tionnaire  did  not  include  similar  questions),  which 
asked  users  to  rate  statements  such  as  “domestic  de¬ 
vices  can  be  operated  efficiently  with  the  system”. 
Thus,  we  will  refer  to  it  as  Acceptability.  Finally, 
our  third  dimension  lines  up  best  with  the  Affect  and 
Annoyance  items  from  SASSI.2  We  will  refer  to  it  as 
Affect. 

Although  the  correspondences  between  our  fac¬ 
tors  and  those  derived  from  SASSI  and  INSPIRE 
are  not  perfect,  the  fact  that  similar  underlying  fac¬ 
tors  arc  derived  from  different  user  groups  and  sys¬ 
tems  indicates  that  they  are  likely  to  be  measuring 
the  same  underlying  constructs. 

5.2  Comparing  satisfaction  in  different  systems 

Recall  that  in  this  study  we  combined  the  data  from 
two  systems:  FULL,  where  the  system  provided  stu¬ 
dents  with  adaptive  feedback  and  hints,  and  BASE, 
where  the  system  simply  acknowledged  the  stu¬ 
dent’s  answers  and  then  provided  a  correct  answer 
without  engaging  in  dialogue.  Table  2  separates  out 
the  average  factor  scores  for  these  two  conditions, 
where  a  factor  score  is  computed  by  averaging  over 
scores  of  all  questions  assigned  to  that  factor. 

When  comparing  learning  gain  and  overall  satis¬ 
faction  between  the  two  systems  (which  is  the  over¬ 
all  impression  of  the  system  behavior  as  a  whole, 
including  circuit  simulation  and  lesson  design),  the 
difference  is  not  statistically  significant  (learning 
gain  t( 69)  =  —  0.95,  p  =  0.35,  overall  satisfac¬ 
tion  i(69)  =  —1.52 ,p  =  0.13).  In  contrast,  on 
individual  dimensions  related  to  tutoring  the  scores 
for  BASE  is  significantly  higher  than  the  score  for 
FULL  ( Transparency ,  t( 69)  =  — 7.19,p  <  0.0001; 
Acceptability:  t (69)  =  — 3.24,p  <  0.01;  Affect: 

2The  acceptability  dimension  from  INSPIRE  is  split  be¬ 
tween  our  factors  2  and  3.  but  most  of  the  questions  correspond 
to  our  factor  2  questions. 


FULL 

BASE 

Transparency 

2.15  (0.56) 

3.36  (0.81) 

Acceptability 

3.11  (1.02) 

3.80  (0.77) 

Affect 

2.43  (0.80) 

2.86  (0.996) 

Overall 

3.39  (0.88) 

3.70  (0.83) 

Learning  gain 

0.61  (0.15) 

0.65  (0.22) 

Table  2:  Average  scores  for  different  satisfaction  dimen¬ 
sions  in  FULL  and  BASE  (standard  deviation  in  parenthe¬ 
ses) 

£(69)  =  — 1.97, p  =  0.05).  Comparing  the  means, 
the  biggest  difference  in  student  ratings  shows  on  the 
Transparency  scale,  while  the  affective  reaction  for 
the  two  systems  is  more  similar  (though  still  rated 
higher  for  BASE). 

It  is  somewhat  unexpected  to  see  that  the  students 
were  equally  satisfied  overall  with  both  systems  but 
rated  the  tutor  in  BASE  more  highly  than  in  FULL, 
since  the  tutor  behavior  was  the  only  thing  different 
between  conditions.  We  arc  at  present  investigating 
the  reasons  for  this  result.  One  possibility  is  that 
when  students  did  not  get  much  feedback  from  the 
tutor  (as  in  BASE),  other  factors  became  more  im¬ 
portant  to  overall  satisfaction,  such  as  course  design 
and  quality  of  user  simulation. 

5.3  Relationships  between  subjective  and 
objective  outcome  measures 

We  investigated  the  correlations  between  learning 
gain  and  different  user  satisfaction  factors  for  the 
two  system  versions.  Results  arc  presented  in  Table 
3.  As  can  be  seen  from  the  table,  learning  gain  and 
user  satisfaction  arc  only  significantly  correlated  in 
FULL,  and  only  for  the  acceptability  and  overall  sat¬ 
isfaction  factors.  None  of  the  factors  in  the  BASE 
system  correlate  with  learning  gain.  This  indicates 
that  the  student’s  affective  reaction  to  the  system  is 
not  necessarily  linked  directly  to  its  objective  bene¬ 
fits.  We  discuss  these  results  further  in  Section  6 

5.4  Impact  of  interpretation  quality  on  user 
satisfaction 

It  is  generally  known  in  SDS  research  that  measures 
of  interpretation  quality  such  as  word  error  rate  and 
concept  accuracy  arc  strongly  correlated  with  user 


FULL 

BASE 

Transparency 

0.32  (0.07) 

0.06  (0.69) 

Acceptability 

0.38  (0.03) 

0.23  (0.16) 

Affect 

0.29  (0.08) 

-0.10(0.53) 

Overall 

0.38  (0.02) 

0.18  (0.28) 

Table  3:  Correlations  between  satisfaction  factors  and 
learning  gain  for  two  dialogue  policies.  Significance  level 
in  parentheses.  Bold  indicates  significance  at  p  <  0.05 
level. 

satisfaction  (e.g.,  (Walker  et  ah,  2000;  Moller  et  ah, 
2007)).  Our  system  uses  typed  input  and  produces 
complex  logical  representations  (rather  than  sim¬ 
ple  slot- value  pairs),  thus,  these  measures  cannot  be 
computed  directly.  However,  in  an  earlier  study  we 
showed  that  another  measure  of  interpretation  qual¬ 
ity,  namely,  percentage  of  utterances  that  could  not 
be  interpreted  by  the  system  (“uninterpretable  utter¬ 
ances”)  is  negatively  correlated  with  learning  gain 
and  user  satisfaction  (Dzikovska  et  ah,  2010b).3 

That  study  revealed  an  unexpected  pattern.  Al¬ 
though  the  system  recorded  the  number  of  utter¬ 
ances  it  could  not  interpret  in  both  FULL  and  BASE, 
students  in  BASE  were  never  informed  of  any  in¬ 
terpretation  problems.  Nevertheless,  the  proportion 
of  such  uninterpretable  utterances  was  still  signifi¬ 
cantly  negatively  correlated  with  user  satisfaction  in 
BASE.  After  analyzing  correlations  between  differ¬ 
ent  types  of  errors  and  user  satisfaction,  we  hypoth¬ 
esized  that  this  can  be  explained  by  the  lack  of  align¬ 
ment  between  the  system  and  the  student,  in  partic¬ 
ular  when  students  used  terminology  different  from 
that  used  by  the  system  (Dzikovska  et  ah,  2010b). 

We  can  now  analyze  this  relationship  in  more  de¬ 
tail,  looking  at  correlations  between  interpretation 
problems  and  different  components  of  user  satisfac¬ 
tion.  The  results  arc  presented  in  Table  4. 

As  can  be  seen  from  the  table,  the  proportion 
of  uninterpretable  answers  is  significantly  correlated 
with  Acceptability  in  FULL,  but  not  in  BASE.  This 
is  not  surprising,  indicating  that  students  who  were 
told  that  they  were  not  understood  perceived  the 
system  as  less  useful  for  them.  More  surprisingly, 
Transparency ,  which  is  related  to  perceived  ease  of 

Tn  that  study,  we  computed  user  satisfaction  with  the  tutor 
by  averaging  over  the  entire  35  questions  in  our  questionnaire 
as  an  initial  approximation. 


FULL 

BASE 

Transparency 

-0.28  (0.1) 

-0.25  (0.10) 

Acceptability 

-0.58  «  0.001) 

-0.29  (0.07) 

Affect 

-0.35  (0.04) 

-0.34  (0.04) 

Overall 

-0.38  (0.03) 

-0.27  (0.11) 

Learning  gain 

-0.38  (0.03) 

-0.09(0.60) 

Table  4:  Correlations  between  satisfaction  factors  and  un¬ 
interpretable  utterances  for  two  different  policies.  Signif¬ 
icance  level  in  parentheses. 


use  for  the  system,  was  not  correlated  with  uninter¬ 
pretable  utterances.  Finally,  the  proportion  of  unin¬ 
terpretable  utterances  is  significantly  correlated  with 
Affect  for  both  systems.  Moreover,  the  unexpected 
negative  correlation  we  observed  in  the  earlier  study 
between  satisfaction  with  the  tutor  and  interpretation 
problems  in  BASE  can  be  primarily  attributed  to  the 
negative  correlation  with  the  Affect  score. 

6  Discussion 

In  this  study,  we  attempted  to  apply  insights  from 
studies  of  user  satisfaction  in  spoken  dialogue  sys¬ 
tems  to  a  different  type  of  dialogue  application:  tu¬ 
torial  dialogue.  We  were  looking  to  develop  a  better 
user  satisfaction  questionnaire  for  evaluating  tutorial 
dialogue  systems,  and  to  implement  an  evaluation 
methodology  which  takes  into  account  different  un¬ 
derlying  dimensions  of  user  satisfaction. 

The  three  dimensions  we  obtained  based  on  ex¬ 
ploratory  factor  analysis  of  REVU-NL  align  well 
with  the  dimensions  reported  in  the  SDS  litera¬ 
ture,  which  provides  some  evidence  of  their  valid¬ 
ity.  However,  the  results  arc  preliminary  because 
of  the  small  number  of  participants  involved,  and 
need  to  be  replicated  with  additional  participants  and 
different  tutoring  systems.  Regardless,  our  analysis 
highlighted  important  issues  in  designing  satisfac¬ 
tion  surveys  for  different  dialogue  genres. 

When  choosing  which  questions  to  include  in  a 
satisfaction  questionnaire  for  a  new  system  type, 
SASSI  is  a  very  attractive  starting  point,  because 
it  was  validated  across  multiple  SDS  in  two  gen¬ 
res  (command  and  control  and  information  seeking). 
This  also  means  that  SASSI  items  arc  phrased  very 
generally  and  therefore  easier  to  adapt.  In  contrast, 
INSPIRE  contains  a  number  of  questions  specific  to 
the  command  and  control  domain,  asking  whether 


the  user  thinks  the  system  is  useful  in  achieving  their 
goals  (i.e.,  operating  the  domestic  devices).  SASSI 
includes  only  one  similar-  item,  “The  system  was 
useful”.  It  was  classed  as  Affect ,  most  likely  be¬ 
cause  there  were  no  other  similar  items.  However, 
we  think  that  such  questions  represent  an  important 
separate  dimension,  namely  the  “perceived  useful¬ 
ness”  factor  known  to  predict  technology  acceptance 
(Adams  et  ah,  1989).  Therefore  we  included  sev¬ 
eral  items  in  REVU-NL  with  similar-  intent,  asking 
whether  users  thought  the  system  was  beneficial  to 
their  goal  (i.e.,  learning  the  material).  These  items 
were  clustered  into  a  separate  dimension  by  factor 
analysis,  indicating  that  they  should  be  included  in 
other  satisfaction  surveys. 

Moreover,  some  of  the  questions  that  appeared 
genre-independent  to  us  proved  to  be  cross-loading 
in  our  analysis,  which  is  an  indicator  of  ambiguity. 
Apparently,  some  of  the  items  from  task-oriented  di¬ 
alogue  questionnaires  did  not  transfer  well.  For  ex¬ 
ample,  statements  like  “The  system  didn’t  always  do 
what  I  expected”  are  unambiguous  for  task-oriented 
dialogue,  where  the  user  is  supposed  to  be  in  control 
of  the  interaction,  and  therefore  has  clear  expecta¬ 
tions  of  what  the  system  should  do.  In  contrast,  in 
tutorial  dialogue  the  tutor  has  control  over  the  learn¬ 
ing  material.  Thus,  it  may  be  more  ambiguous  as 
to  what,  if  anything,  students  are  expecting  from  the 
interaction. 

Overall,  our  experience  shows  that  it  may  not 
be  possible,  or  indeed  useful,  to  create  completely 
generic  surveys.  However,  we  believe  that  question¬ 
naires  can  be  phrased  generally  enough  to  apply  to  a 
range  of  systems  with  similar  goals,  and  REVU-NL 
in  particular  is  useful  stalling  point  for  comparing 
dialogue-based  tutoring  systems.  We  believe  that  the 
18  questions  that  we  retained  as  unambiguous  in  our 
analysis  provide  adequate  assessment  of  user  satis¬ 
faction,  and  are  grouped  into  factors  consistent  with 
results  of  previous  research.  However,  the  question¬ 
naire  could  be  further  improved  by  revisiting  the 
cross-loading  items  we  rejected  as  ambiguous,  and 
seeing  if  their  wording  could  be  improved.  We  are 
also  intending  to  use  REVU-IT  in  evaluating  a  spo¬ 
ken  version  of  Beetle  II,  thus  providing  additional 
validation  data  on  a  different  version  of  the  interface. 

With  respect  to  evaluation  methodology,  our  re¬ 
sults  highlight  the  need  to  look  at  different  satis- 


faction  dimensions  separately.  We  used  our  fac¬ 
tors  to  further  investigate  a  pattern  that  we  discov¬ 
ered  in  previous  research,  namely,  that  students  who 
speak  in  a  way  that  is  difficult  for  the  system  to  in¬ 
terpret  tend  to  be  less  satisfied  with  the  tutor,  even 
when  they  are  not  told  of  the  interpretation  prob¬ 
lems.  Looking  at  correlations  with  individual  di¬ 
mensions  shows  that  this  relationship  is  primarily 
explained  by  the  Affect  dimension.  Our  working  hy¬ 
pothesis  is  that  the  lack  of  alignment  between  in¬ 
correct  student  answers  and  the  answers  supplied  by 
the  system  caused  students  to  perceive  the  system  as 
a  less  likeable  or  cooperative  conversational  partner. 

We  also  observed  that  Acceptability ,  but  no  other 
dimensions,  were  correlated  with  learning  gain  in 
FULL.  One  possible  explanation  is  that  students  who 
arc  learning  more  believe  that  the  system  is  help¬ 
ing  them  reach  their  goals  (our  definition  of  Accept¬ 
ability).  The  FULL  condition  provides  students  with 
more  explicit  feedback  as  to  their  learning;  whereas 
in  BASE  students  may  have  a  less  accurate  estimate 
of  how  well  they  arc  doing,  and  hence  no  satisfaction 
dimensions  arc  correlated  with  learning  gain. 

It  is  worth  noting  that  an  earlier  study  investigat¬ 
ing  the  relationship  between  user  satisfaction  and 
learning  in  two  different  tutorial  dialogue  systems 
(Forbes-Riley  and  Litman,  2009)  found  little  corre¬ 
lation  between  the  answers  to  individual  questions 
on  their  satisfaction  questionnaire  and  learning  gain. 
Only  one  correlation,  with  the  question  “The  tutor 
helped  me  to  concentrate”,  reached  significance  in 
only  one  of  the  4  conditions  they  investigated.  This 
adds  further  evidence  that  the  relationship  between 
learning  gain  and  satisfaction  is  not  straightforward. 
However,  our  results  arc  difficult  to  compare  since 
the  questionnaires  used  arc  different,  and  Forbes- 
Riley  and  Litman  (2009)  arc  studying  correlations 
with  individual  questions  rather  than  grouping  re¬ 
lated  questions  together.  Developing  better  validated 
questionnaires  will  make  such  results  easier  to  com¬ 
pare  and  interpret,  and  we  believe  that  REVU-NL 
makes  a  significant  step  in  that  direction. 

7  Conclusion  and  Future  Work 

In  this  paper,  we  proposed  an  improved  question¬ 
naire  (REVU-NL)  for  evaluating  user  satisfaction 
in  tutorial  dialogue  systems,  which  is  an  important 


evaluation  metric  alongside  learning  gain.  We  used 
the  methodology  from  SDS  evaluations  to  investi¬ 
gate  different  dimensions  of  user  satisfaction,  and 
their  relationship  to  learning  gain  and  different  in¬ 
teraction  properties.  Next,  we  arc  planning  to  use 
the  PARADISE  methodology  to  establish  predictive 
models  that  relate  satisfaction  dimensions  to  mea¬ 
surable  interaction  properties,  so  that  we  can  de¬ 
termine  development  priorities,  and  make  it  eas¬ 
ier  to  compare  different  system  versions.  We  arc 
also  planning  to  collect  additional  questionnaire  data 
with  a  speech-enabled  version  of  the  system,  and 
verify  our  analyses  on  this  extended  data  set. 

Acknowledgments 

This  work  has  been  supported  in  paid  by  US  Of¬ 
fice  of  Naval  Research  grants  N000141010085  and 
N0001410WX20278.  We  would  like  to  thank  our 
sponsors  from  the  Office  of  Naval  Research,  Dr.  Su¬ 
san  Chipman  and  Dr.  Ray  Perez,  and  the  Research 
Associates  who  worked  on  this  project,  Kather¬ 
ine  Harrison,  Leanne  Taylor,  Charles  Scott,  Simon 
Caine,  Elaine  Farrow  and  Charles  Callaway  for  their 
contribution  to  this  effort. 

References 

Dennis  A.  Adams,  R.  Ryan  Nelson,  and  Peter  A.  Todd. 
1989.  Perceived  usefulness,  ease  of  use,  and  usage  of 
information  technology.  MIS  Quarterly.,  13:319-339. 
Myroslava  Dzikovska,  Natalie  B.  Steinhauser,  Jo¬ 
hanna  D.  Moore,  Gwendolyn  E.  Campbell,  Kather¬ 
ine  M.  Harrison,  and  Leanne  S.  Taylor.  2010a.  Con¬ 
tent,  social,  and  metacognitive  statements:  An  em¬ 
pirical  study  comparing  human-human  and  human- 
computer  tutorial  dialogue.  In  Sustaining  TEL:  From 
Innovation  to  Learning  and  Practice  -  5th  European 
Conference  on  Technology  Enhanced  Learning  (EC- 
TEL  2010),  pages  93-108,  Barcelona,  Spain,  October. 
Myroslava  O.  Dzikovska,  Johanna  D.  Moore,  Natalie 
Steinhauser,  and  Gwendolyn  Campbell.  2010b.  The 
impact  of  interpretation  problems  on  tutorial  dialogue. 
In  Proceedings  of  the  48th  Annual  Meeting  of  the  A.s- 
sociation  for  Computational  Linguistics(ACL-2010), 
Uppsala,  Sweden,  July. 

Myroslava  O.  Dzikovska,  Johanna  D.  Moore,  Natalie 
Steinhauser,  Gwendolyn  Campbell,  Elaine  Farrow, 
and  Charles  B.  Callaway.  2010c.  Beetle  II:  a  system 
for  tutoring  and  computational  linguistics  experimen¬ 
tation.  In  Proceedings  of  the  48th  Annual  Meeting  of 


the  Association  for  Computational  Linguistics  (ACL- 
2010)  demo  session ,  Uppsala,  Sweden,  July. 

Katherine  Forbes-Riley  and  Diane  J.  Litman.  2009. 
Adapting  to  student  uncertainty  improves  tutoring  dia¬ 
logues.  In  Artificial  Intelligence  in  Education:  Build¬ 
ing  Learning  Systems  that  Care:  From  Knowledge 
Representation  to  Affective  Modelling,  Proceedings 
of  the  14th  International  Conference  on  Artificial  In¬ 
telligence  in  Education  (AIED  2009),  pages  33-40, 
Brighton,  UK,  July. 

Katherine  Forbes-Riley  and  Diane  J.  Litman.  2011. 
Designing  and  evaluating  a  wizarded  uncertainty- 
adaptive  spoken  dialogue  tutoring  system.  Computer 
Speech  &  Language,  25(  1):  105 — 126. 

Katherine  Forbes-Riley,  Diane  J.  Litman,  Scott  Silliman, 
and  Joel  R.  Tetreault.  2006.  Comparing  synthesized 
versus  pre-recorded  tutor  speech  in  an  intelligent  tu¬ 
toring  spoken  dialogue  system.  In  Proceedings  of 
the  Nineteenth  International  Florida  Artificial  Intelli¬ 
gence  Research  Society  Conference,  pages  509-514, 
Melbourne  Beach,  Florida,  USA,  May. 

Kate  S.  Hone  and  Robert  Graham.  2000.  Towards  a 
tool  for  the  subjective  assessment  of  speech  system 
interfaces  (SASSI).  Natural  Language  Engineering, 
6(3&4):287-303. 

G.  Tanner  Jackson,  Arthur  C.  Graesser,  and  Danielle  S. 
McNamara.  2009.  What  students  expect  may  have 
more  impact  than  what  they  know  or  feel.  In  Proceed¬ 
ings  14th  International  Conference  on  Artificial  Intel¬ 
ligence  in  Education  (AIED),  Brighton,  UK. 

Lars  Bo  Larsen.  2003.  Issues  in  the  evaluation  of  spo¬ 
ken  dialogue  systems  using  objective  and  subjective 
measures.  In  Proceedings  of  2003  IEEE  Workshop 
on  Automatic  Speech  Recognition  and  Understanding 
(ASRU’03),  pages  209  -  214,  December. 

Diane  Litman,  Johanna  Moore,  Myroslava  Dzikovska, 
and  Elaine  Farrow.  2009.  Using  natural  language  pro¬ 
cessing  to  analyze  tutorial  dialogue  corpora  across  do¬ 
mains  and  modalities.  In  Proceedings  of  14th  Interna¬ 
tional  Conference  on  Artificial  Intelligence  in  Educa¬ 
tion  (AIED),  Brighton,  UK,  July. 

Joel  Michael,  Allen  Rovick,  Michael  Glass,  Yujian  Zhou, 
and  Martha  Evens.  2003.  Learning  from  a  computer 
tutor  with  natural  language  capabilities.  Interactive 
Learning  Environments,  11:233-262(30). 

Sebastian  Moller,  Paula  Smeele,  Heleen  Boland,  and  Jan 
Krebber.  2007.  Evaluating  spoken  dialogue  systems 
according  to  de-facto  standards:  A  case  study.  Com¬ 
puter  Speech  &  Language,  21(1):26  -  53. 

Amruta  Purandare  and  Diane  Litman.  2008.  Content¬ 
learning  correlations  in  spoken  tutoring  dialogs  at 
word,  turn  and  discourse  levels.  In  Proceedings  of 
the  21st  International  FLAIRS  Conference,  Coconut 
Grove,  Florida,  May. 


Marilyn  A.  Walker,  Candace  A.  Kamm,  and  Diane  J.  Lit¬ 
man.  2000.  Towards  Developing  General  Models  of 
Usability  with  PARADISE.  Natural  Language  Engi¬ 
neering,  6(3). 

Maria  Wolters,  Kallirroi  Georgila,  Robert  Logie,  Sarah 
MacPherson,  Johanna  Moore,  and  Matt  Watson.  2009. 
Reducing  working  memory  load  in  spoken  dialogue 
systems.  Interacting  with  Computers,  21(4):276-287. 


A  REVU-NL  Questions 


1 1  I  felt  in  control  of  my  conversations  with  the  tutor. 

t2  It  took  the  tutor  too  long  to  respond  to  my  statements. 

t3  I  felt  that  the  tutor  understood  me  well. 

t4  The  tutor  didn't  always  do  what  I  expected. 

t5  The  information  that  the  tutor  provided  to  me  was  incomplete. 

t6  It  was  easy  for  me  to  become  confused  during  our  dialogue. 

t7  I  enjoyed  talking  with  the  tutor. 

t8  The  tutor  interfered  with  my  understanding  of  the  topics  in  electricity  and  circuits. 

t9  I  was  not  always  sure  what  the  tutor  expected  of  me. 

tlO  Conversing  with  the  tutor  was  fun. 

til  It  was  easy  to  understand  the  things  that  the  tutor  said. 

1 12  The  dialogue  between  me  and  the  tutor  was  very  repetitive. 

1 1 3  I  had  to  really  concentrate  when  I  was  talking  with  the  tutor. 

1 14  The  tutor  was  too  inflexible. 

1 15  Working  through  the  lessons  with  the  computer  tutor  was  as  easy  as  working  through  the  lessons 
with  a  human  tutor. 

1 16  The  tutor  didn’t  always  do  what  I  wanted. 

1 17  I  felt  confident  when  talking  with  the  tutor. 

1 1 8  I  always  knew  what  to  say  to  the  tutor. 

1 19  I  was  able  to  recover  easily  from  errors  during  our  dialogues. 
t20  Talking  with  the  tutor  was  frustrating. 

t2 1  The  information  provided  by  the  tutor  was  clear. 

t22  It  was  easy  to  interact  with  the  tutor. 

t23  The  tutor’s  dialogue  was  clumsy  and  unnatural. 

t24  It  was  easy  to  learn  how  to  speak  to  the  tutor  in  a  way  that  the  tutor  understood. 

t25  I  felt  comfortable  talking  with  the  tutor. 

t26  I  didn’t  always  understand  what  the  tutor  meant. 

t27  The  tutor  was  not  helpful. 

t28  I  found  conversing  with  the  tutor  to  be  irritating. 

t29  I  knew  what  I  could  say  or  do  at  each  point  in  the  conversation  with  the  tutor. 
t30  I  found  our  dialogues  to  be  boring. 

t3 1  Having  the  tutor  help  me  with  the  material  was  an  efficient  way  to  learn. 
t32  It  was  easy  to  learn  from  the  tutor. 
t33  The  tutor  responded  quickly. 
t34  Having  the  tutor  was  worthwhile 

t35  Our  dialogues  quickly  led  to  me  having  a  deeper  understanding  of  the  material. 

B  RE VU- OVERALL  questions 

o  1  Overall,  I  am  satisfied  with  my  experience  learning  about  electricity  from  this  system. 

o2  Working  in  this  learning  environment  was  just  like  working  one-on-one  with  a  human  tutor. 

o3  I  would  have  preferred  to  learn  about  electricity  in  a  different  way. 

o4  I  would  use  this  system  again  in  the  future  to  continue  to  learn  about  electricity. 

o5  I  would  like  to  be  able  to  use  a  system  like  this  to  learn  about  other  topics  in  the  future. 


C  REVU-IT  questions  related  to  GUI  and  reading  material  (mentioned  but  not  analyzed 
in  the  paper) 

sll  It  was  easy  to  navigate  through  the  slides. 
sl2  It  took  a  long  time  for  each  new  slide  to  be  displayed. 
sl3  The  material  on  the  slides  was  easy  to  understand. 

sl4  The  material  on  the  slides  was  poorly  written. 

sl5  I  would  have  benefited  from  more  instrucion  on  how  to  move  through  the  slides. 
sl6  The  material  on  the  slides  was  interesting. 

sl7  The  slide  navigation  buttons  didn't  always  work  the  way  I  expected  them  to. 
sl8  The  slides  were  annoying. 

sl9  The  material  on  the  slides  was  written  at  a  level  far  beneath  my  abilities, 
si  10  I  would  prefer  reading  a  text  book  over  reading  these  slides. 

el  I  found  it  difficult  to  learn  how  to  build  circuits  and  take  measurements  in  the  workspace. 

e2  Completing  exercises  in  the  workspace  was  fun. 

e3  Before  beginning  the  lesson,  I  received  the  right  amount  of  instruction  on  how  to  build  circuits  in 

the  workspace  and  take  measurements. 

e4  The  exercises  were  well  designed  to  illustrate  the  important  lesson  concepts. 

e5  Sometimes  I  didn’t  understand  what  I  was  supposed  to  do  for  an  exercise. 

e6  The  method  for  connecting  components  with  wires  was  counter-intuitive. 
e7  Having  to  build  all  those  circuits  was  annoying. 

e8  I  always  knew  exactly  what  to  build  and/or  measure  in  the  workspace,  and  how  to  do  it. 
e9  Circuits  loaded  quickly. 

elO  Even  if  I  didn’t  predict  the  outcome  correctly  ahead  of  time,  once  I  completed  an  exercise,  I 
always  understood  the  point, 
ell  It  was  easy  to  use  the  meter. 

el2  There  were  more  exercises  than  necessary  to  cover  the  lesson  topics. 

e  13  I  would  have  learned  more  if  I  had  been  able  to  build  circuits  with  actual  light  bulbs  and  batteries. 


